A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

نویسندگان

Bin He

Chengkai Li

David Killian

Mitesh Patel

Yuping Tseng

Kevin Chen-Chuan Chang

چکیده

The Web has been rapidly “deepened” by massive databases online: Recent surveys show that while the surface Web has linked billions of static HTML pages, a far more significant amount of information is “hidden” in the deep Web, behind the query forms of searchable databases. With its myriad databases and hidden content, this deep Web is an important frontier for information search. In this paper, we develop a novel Web Form Crawler to collect the “doors” of Web databases, i.e., query forms, to build a database for online databases in both efficient and comprehensive manners. Being object-focused, topic-neutral and coveragecomprehensive, such a crawler, while critical to searching and integrating online databases, has not been extensively studied. In particular, query forms, while many, when compared with the size of the Web, are sparsely scattered among pages, which brings new challenges for focused crawling: First, due to the topic-neutral nature of our crawling problem, we cannot rely on existing topicfocused crawling techniques. Second, existing focused crawling cannot achieve the comprehensiveness requirement because it is not able to be aware of the coverage of crawled content. As a new attempt, we propose a structure-driven crawling framework by observing structure locality of query forms– That is, query forms are often close to root pages of Web sites and accessible by following navigational links. Exploring this structure locality, we substantiate the structure-driven crawling framework into a site-based Web Form Crawler by first collecting the site entrances, as the Site Finder, and then searching for query forms within the scope of each site, as the Form Finder. Analytical justification and empirical evaluation of the Web Form Crawler both show that: 1) our crawler can maintain stable harvest and coverage throughout the crawling, and 2) compared to page-based crawling, our best harvest rate is about 10 to 400 times better, depending on the page traversal schemes used.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Crawler: A Review

Information Retrieval deals with searching and retrieving information within the documents and it also searches the online databases and internet. Web crawler is defined as a program or software which traverses the Web and downloads web documents in a methodical, automated manner. Based on the type of knowledge, web crawler is usually divided in three types of crawling techniques: General Purpo...

متن کامل

Crawling the Content Hidden Behind Web Forms

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hiddenweb crawler able to access su...

متن کامل

Research on discovering deep web entries

Ontology plays an important role in locating Domain-Specific Deep Web contents, therefore, this paper presents a novel framework WFF for efficiently locating Domain-Specific Deep Web databases based on focused crawling and ontology by constructing Web Page Classifier(WPC), Form Structure Classifier(FSC) and Form Content Classifier(FCC) in a hierarchical fashion. Firstly, WPC discovers potential...

متن کامل

A Novel Term Weighing Scheme Towards Efficient Crawl of Textual Databases

The Hidden Web is the vast repository of informational databases available only through search form interfaces, accessible by therein typing a set of keywords in the search forms. Typically, a Hidden Web crawler is employed to autonomously discover and download pages from the Hidden Web. Traditional hidden web crawlers do not provide the search engines with an optimal search experience because ...

متن کامل

Efficient Web Data Mining with Standard XML Technologies

The problem of Web data extraction and XML-based methodology whose goal extends far beyond simple “screen scraping are discussed.” An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those databases as a result. What is needed in this process is much more than a Web crawler and set of Web site wrap...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

نویسندگان

چکیده

منابع مشابه

Web Crawler: A Review

Crawling the Content Hidden Behind Web Forms

Research on discovering deep web entries

A Novel Term Weighing Scheme Towards Efficient Crawl of Textual Databases

Efficient Web Data Mining with Standard XML Technologies

عنوان ژورنال:

اشتراک گذاری